EDA & Predict sales of summer clothes in E-commerce Wish

Background

Studying top products requires more than just product listings. You also need to know what sells well and what does not.

This dataset contains product listings as well as products ratings and sales performance.

With this, you can finally start to look for correlations and patterns regarding the success of a product and the various components.

Task

Develop a model for product success - Help businesses answer the crucial question "How well is a product likely to sell ?"

i.e. Build a model that can help predict how well a product is going to sell. => This is a regression problem to predict number of units sold.

Such a model has many implications and could be used in many different ways, the most straightforward being to adjust how much of a product should be kept in stock.

Inspiration (What factors will affect product success?)

  1. How about trying to validate the established idea of human sensitiveness to price drops ? (impact of discount on units sold)

  2. What are the top categories of products which sell best?

  3. Do bad products sell ? How is the relationship between the quality of a product (ratings) and its success ? Does the price factor into this ?

  4. Do seller's fame factor into top products ? (merchant name)

  5. Does using ad boost help to sell more?

  6. Do the number of tags (making a product more discoverable) factor into the success of a product ?

Data Source: https://www.kaggle.com/jmmvutu/summer-products-and-sales-in-ecommerce-wish

Initial Hypothesis

The following factors might bring positive impact on number of unit sold:

Steps

  1. Understanding the data
  2. EDA
    • Univariate Analysis - to look into the distribution of the independent variables
    • Bivariate Analysis - to answer the inspiration questions + look into the relationship between the independent variables & the target variable
  1. Handling of Duplicated rows, Missing values, Outliers
  2. Model building and Evalution

1. Understanding the data

Null values

2. EDA

Let's have a look into the independent variables:

Target variable: units_sold

2.1 Univariate Analysis

Categorical

Ordinal

Modify the size and color columns before doing visualization.

Numerical

Mostly no-of-units-sold range around 100~20000. But still there are extreme cases, some products are not selling good (<100) while some are selling extremely good (>50000).

Seems like inventory_total only has 1 value?!

As shown, 'price', 'retail_price' have outliers to handle.

Except for few rows, most inventory value is 50. Inventory total will not be considered for modelling. In real practice, the parameter may not be dropped out if the values do have more variation against each other.

2.2 Bivariate Analysis

Part 1. Inpiration Questions

1. How is the impact of discounted price compared to original retail_price on number of unit sold?

Looks like no obvious relationship, discount has no strong impact on units_sold.

Even we breakdown the product into different price range, no obvious relationship shown for discount vs. unit sold.

2. What are the top categories of products that sell best?

Dress & Tops are the best selling items. There are many categories which look similar and are attached to the same product. It's hard to distinguish the product by categories.

3. (Do bad products sell ? ) How is the relationship between the quality of a product (ratings) and units sold? Does the price factor affect the ratings?

=> Rating has to be above 3 to get higher unit-sold number which mean bad products do not sell. Higher price may not mean higher rating; Lower price has a tendency to get high unit-sold number.

More obvious result is seen. High unit-sold number occurs when price range falls in 0-10, with rating being concentrated between 3-4.5

Every rating count showed similar pattern, as count increases unit-sold number increases. It actually make sense, as a product sell better, there are more reviews on the product no more good or bad, thats why raitng count increases with units-sold number...

4. Do seller's fame factor into top products? i.e. Any relationship between merchant name and units sold?

Some merchants have high number of units sold. But I am not sure if this is because the seller fame or just these merchants dominate the product range? Let's look into the units-sold-per-product number.

Result is a bit different from our first approach, but some merchants do hit high number of unit-sold per product

5. Does using ad boost help to sell more?

Result was not expected. Using ad boost should help to sell more products...otherwise there is no point of doing ad boost. Let's dive into different price range and have a look again.

Looks like ad boosting works when the product price is higher, especially for those above 20 dollars.

6. Do the number of tags (making a product more discoverable) factor into the success of a product ?

Looks like no of tags does not have obvious relationship with number of units sold, i.e. no of tags does not factor into the success of a product.

Part 2. Let's have a look into other factors

Categorical Independent Variable vs Target Variable ('units_sold_per_product')

Pattern seen:

No conclusion:

Ordinal Independent Variable vs Target Variable ('units_sold_per_product')

Pattern spotted:

No conclusion:

Numerical Independent Variable vs Target Variable 'units_sold_per_product'

Correlation Map

Action points after EDA:

Numerical:

3. Duplicated rows, Missing values, Outlier handling

Drop the duplicated rows

Drop the irrelevant variables

Missing values handling

Outlier handling

The distribution looks much closer to normal and the effect of extreme values has been significantly subsided.

4. Model building and Evaluation

Models selected: Linear Regression, Lasso, Random Forest, XGBoost

Cross Validation method: K-Fold Cross Validation

Evaluation metric: R-Squared Score

Make dummy variables for the categorical variable

Prepare the data for modelling

A. Linear Regression

B. Lasso Regression

C. Random Forest

Tree based bootstrapping algorithm where a certain no of weak learners(decision trees) are combinded to make a powerful prediction model. For every individual learner, a random sample of rows and a few randomly chosen variables are used to build a decision tree model. Final prediction can be a fx. of all the predictions made by the individual learners. In case of regression problems, the final prediction can be mean of all the predictions.

We will try to improve the accuracy by tuning the hyperparameters for this model. We will use grid search to get the optimized values of hyper parameters.Grid-search is a way to select the best of a family of hyper parameters, parametrized by a grid of parameters.

We will tune the max_depth and n_estimators parameters. max_depth decides the max depth of tree and n_estimators decides the no of trees that will be used in random forest model.

Feature Importance

To find which features are most important to this problem

D. XGBoost

XGBoost is a fast and efficient algorithm. It is a boosting algorithm. XGBoost works only with numeric varibles and we have already replaced the categorical variables with numeric variables. Let's have a look at the parameters that we are going to use in our model.

(will apply the same setting as random forest)

Compare the R-Squared scores of different models

the higher the score, the better the model

Conclusion

From the performance as above, Random Forest model performs the best with highest R-Squared Score and is recommended to forecast units sold for each product.